Skip to content

ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file#12812

Closed
niyue wants to merge 2 commits into
apache:masterfrom
niyue:bugfix/ipc-batch-meta
Closed

ARROW-16131 [C++] support saving and retrieving custom metadata in batches for IPC file#12812
niyue wants to merge 2 commits into
apache:masterfrom
niyue:bugfix/ipc-batch-meta

Conversation

@niyue

@niyue niyue commented Apr 6, 2022

Copy link
Copy Markdown
Contributor

This PR aims to address https://issues.apache.org/jira/projects/ARROW/issues/ARROW-16131

Currently, when writing an IPC file with multiple record batches using the arrow::ipc::RecordBatchWriter, the custom_metadata for each record batch is not saved and will be discarded silently.

The current writer does provide the AppendCustomMetadata API internally, but it is never called. I use this API to add the custom metadata during writing and retrieve the custom metadata when reading the record batch. I am not completely sure if this is intentionally ignored or accidentally missing, but from a user's perspective, the metadata can be set via public API (see the example case in ARROW-16131) and I think it is not very friendly that this piece of data is silently discarded.

Several test cases are added in the read_write_test.cc for verifying this change.

@github-actions

github-actions Bot commented Apr 6, 2022

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Apr 6, 2022

Copy link
Copy Markdown

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@niyue

niyue commented Apr 6, 2022

Copy link
Copy Markdown
Contributor Author

I checked out all the build failures reported , including arrow-gcsfs-test (Timeout) and arrow-s3-test (Failed), and they don't seem to be related with my change.

I can build and run the S3 test locally, and it ran into segfault locally as the CI server did for the test case GetFileInfoRoot, but this seems not related with the PR change:

TEST_F(TestS3FS, GetFileInfoRoot) { AssertFileInfo(fs_.get(), "", FileType::Directory); }

For the gcs test, I failed to build the google cloud cpp lib and I am not able to run it locally so far, but it seems happened in other PR as well (for example, #12816)

Comment thread cpp/src/arrow/ipc/test_common.cc Outdated
Comment thread cpp/src/arrow/ipc/writer.cc Outdated
Comment thread cpp/src/arrow/ipc/reader.cc Outdated
Comment thread cpp/src/arrow/ipc/writer.cc Outdated
@niyue niyue force-pushed the bugfix/ipc-batch-meta branch from e337fa6 to 6320bd1 Compare April 8, 2022 05:53
@pitrou

pitrou commented Apr 12, 2022

Copy link
Copy Markdown
Member

Hmm, I don't think this is the right way to do this. The IPC message's custom_metadata is independent from record batch (i.e. schema) metadata, one should not influence the other.

(also, you could have a custom_metadata for other message types such as Tensor)

cc @wesm for opinions.

@niyue

niyue commented Apr 12, 2022

Copy link
Copy Markdown
Contributor Author

The IPC message's custom_metadata is independent from record batch (i.e. schema) metadata

This is actually my initial understanding, according to this documentation (https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata), it says This includes Field, Schema, and Message. and it seems to indicate schema metadata is different from message metadata.

But for the documentation here (https://arrow.apache.org/docs/python/data.html#custom-schema-and-field-metadata), it says Arrow supports both schema-level and field-level custom key-value metadata allowing for systems to insert their own application defined metadata to customize behavior. and record batch (message level) metadata is not mentioned.

And I only see API/tests allowing users to modify schema's metadata (probably I am missing something), so I assumed my initial understanding is incorrect and came up with such a PR change.

If message metadata is different from schema metadata, the question here may be slightly more complex since in this case, each record batch may provide 1) this record batch's schema specific metadata, this seems not serialized/deserialized 2) this record batch's message specific metadata, we don't seem to provide API for reading/writing this (internally we have AppendCustomMetadata but it is not called and there seems no higher level API allowing users to pass this information)

Please advice what the correct behavior should be so that I could give it a try. Thanks.

@pitrou

pitrou commented Apr 13, 2022

Copy link
Copy Markdown
Member

So, basically, per-message metadata is an Arrow IPC feature that's independent from record batches or schemas. That's why you don't see it mentioned in the schema metadata documentation.

If message metadata is different from schema metadata, the question here may be slightly more complex since in this case, each record batch may provide 1) this record batch's schema specific metadata, this seems not serialized/deserialized 2) this record batch's message specific metadata, we don't seem to provide API for reading/writing this

Indeed the message metadata wouldn't be attached to the record batch or schema, but passed separately.

One possible API on the write side:

class ARROW_EXPORT RecordBatchWriter {
 public:
  // (default implementation could call `WriteRecordBatch(batch)` while ignoring metadata?)
  virtual Status WriteRecordBatch(const RecordBatch& batch, const KeyValueMetadata& custom_metadata);

and on the read side:

class ARROW_EXPORT RecordBatchReader {
 public:
  // (default implementation could call `ReadNext(batch)` and reset `*custom_metadata` to null)
  virtual Status ReadNext(std::shared_ptr<RecordBatch>* batch, std::shared_ptr<KeyValueMetadata>* custom_metadata);

@niyue

niyue commented Apr 14, 2022

Copy link
Copy Markdown
Contributor Author

One possible API on the write side

Thanks for the advice. Let me give it a try.

@niyue niyue force-pushed the bugfix/ipc-batch-meta branch from 6320bd1 to f28dfb3 Compare April 17, 2022 08:18
Comment thread cpp/src/arrow/ipc/reader.h Outdated
Comment thread cpp/src/arrow/record_batch.h Outdated
@niyue

niyue commented Apr 17, 2022

Copy link
Copy Markdown
Contributor Author

@pitrou @lidavidm
I submitted a new commit following the suggestion above. I tried to keep the changes compatible with existing public APIs so several new APIs are added. Could you please help to review? Let me know if there is any issue. Thanks.

For the CI:

  1. I found one test case 80 - arrow-compute-hash-join-node-test failed in C++ / AMD64 Conda C++ (pull_request), but I guess this is probably not caused by my PR
  2. another test case 51 - arrow-flight-test (Failed) failed in continuous-integration/appveyor/pr, I am not able to find the detailed test report in the CI job, is this an issue?

@pitrou

pitrou commented Apr 18, 2022

Copy link
Copy Markdown
Member

@niyue Indeed, these CI failures are unrelated. I'll let @lidavidm take a look at the Flight one; the hash-join-node-test crash is at https://issues.apache.org/jira/browse/ARROW-15361

@lidavidm

Copy link
Copy Markdown
Member

I filed ARROW-16215

Comment thread cpp/src/arrow/ipc/reader.h Outdated
Comment thread cpp/src/arrow/record_batch.h Outdated
Comment thread cpp/src/arrow/ipc/reader.cc Outdated
@niyue niyue force-pushed the bugfix/ipc-batch-meta branch from f28dfb3 to e65856a Compare April 20, 2022 02:23
Comment thread cpp/src/arrow/ipc/reader.cc Outdated
@niyue niyue force-pushed the bugfix/ipc-batch-meta branch 3 times, most recently from 96df81a to 77c2988 Compare April 24, 2022 03:12
@niyue

niyue commented Apr 24, 2022

Copy link
Copy Markdown
Contributor Author

I rebased the PR onto the latest master branch and resolved the conflict in read_write_test.cc.

@pitrou @lidavidm could you please help to review if there is anything else that needs to be revised? Thanks.

@pitrou pitrou force-pushed the bugfix/ipc-batch-meta branch from 77c2988 to 3c5b3c0 Compare April 25, 2022 14:18

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's wait for CI

@pitrou pitrou force-pushed the bugfix/ipc-batch-meta branch from 3c5b3c0 to 18d9995 Compare April 25, 2022 14:23
@pitrou pitrou force-pushed the bugfix/ipc-batch-meta branch from 18d9995 to b01cdb5 Compare April 25, 2022 14:26
@pitrou pitrou closed this in 942f77e Apr 25, 2022
@pitrou

pitrou commented Apr 25, 2022

Copy link
Copy Markdown
Member

Thanks a lot for your contribution @niyue !

@niyue

niyue commented Apr 25, 2022

Copy link
Copy Markdown
Contributor Author

@pitrou @lidavidm thanks so much for the detailed review

@ursabot

ursabot commented May 1, 2022

Copy link
Copy Markdown

Benchmark runs are scheduled for baseline = 52698f3 and contender = 942f77e. 942f77e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️0.0% ⬆️0.36%] ursa-i9-9960x
[Finished ⬇️0.88% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 942f77e5 ec2-t3-xlarge-us-east-2
[Failed] 942f77e5 test-mac-arm
[Failed] 942f77e5 ursa-i9-9960x
[Finished] 942f77e5 ursa-thinkcentre-m75q
[Finished] 52698f35 ec2-t3-xlarge-us-east-2
[Failed] 52698f35 test-mac-arm
[Finished] 52698f35 ursa-i9-9960x
[Finished] 52698f35 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants